PALSSN.002C1 



METHODS FOR IDENTIFYING DRUG TARGETS 
BASED ON GENOMIC SEQUENCE DATA 



PATENT 



Background of the Invention 

Related Applications 

This application in a continuation of Application Number 09/243,022, filed February 2, 



1999. 



Field of the Invention 

This invention relates to methods for identifying drug targets based on genomic sequence 
data. More specifically, this invention relates to systems and methods for determining suitable 
molecular targets for the directed development of antimicrobial agents. 
Description of the Related Art 

Infectious disease is on a rapid rise and threatens to regain its status as a major health 
problem. Prior to the discovery of antibiotics in the 1930s, infectious disease was a major cause 
of death. Further discoveries, development, and mass production of antibiotics throughout the 
1940s and 1950s dramatically reduced deaths from microbial infections to a level where they 
effectively no longer represented a major threat in developed countries. 

Over the years antibiotics have been liberally prescribed and the strong selection pressure 
that this represents has led to the emergence of antibiotic resistant strains of many serious human 
pathogens. In some cases selected antibiotics, such as vancomycin, literally represent the last line 
of defense against certain pathogenic bacteria such as Staphylococcus. The possibility for 
staphylococci to acquire vancomycin resistance through exchange of genetic material with 
enterococci, which are commonly resistant to vancomycin, is a serious issue of concern to health 
care specialists. The pharmaceutical industry continues its search for new antimicrobial 
compounds, which is a lengthy and tedious, but very important process. The rate of development 
and introduction of new antibiotics appears to no longer be able to keep up with the evolution of 
new antibiotic resistant organisms. The rapid emergence of antibiotic resistant organisms 
threatens to lead to a serious widespread health care concern. 



The basis of antimicrobial chemotherapy is to selectively kill the microbe with minimal, 
and ideally no, harm to normal human cells and tissues. Therefore, ideal targets for antibacterial 
action are biochemical processes that are unique to bacteria, or those that are sufficiently 
different from the corresponding mammalian processes to allow acceptable discrimination 
between the two. For effective antibiotic action it is clear that a vital target must exist in the 
bacterial cell and that the antibiotic be delivered to the target in an active form. Therefore 
resistance to an antibiotic can arise from: (i) chemical destruction or inactivation of the 
antibiotic; (ii) alteration of the target site to reduce or eliminate effective antibiotic binding; (iii) 
blocking antibiotic entry into the cell, or rapid removal from the cell after entry; and (iv) 
replacing the metabolic step inhibited by the antibiotic. 

Thus, it is time to fundamentally re-examine the philosophy of microbial killing 
strategies and develop new paradigms. One such paradigm is a holistic view of cellular 
metabolism. The identification of "sensitive" metabolic steps in attaining the necessary 
metabolic flux distributions to support growth and survival that can be attacked to weaken or 
destroy a microbe, need not be localized to a single biochemical reaction or cellular process. 
Rather, different cellular targets that need not be intimately related in the metabolic topology 
could be chosen based on the concerted effect the loss of each of these functions would have on 
metabolism. 

A similar strategy with viral infections has recently proved successful. It has been shown 
that "cocktails" of different drugs that target different biochemical processes provide enhanced 
success in fighting against HIV infection. Such a paradigm shift is possible only if the necessary 
biological information as well as appropriate methods of rational analysis are available. Recent 
advances in the field of genomics and bioinformatics, in addition to mathematical modeling, 
offer the possibility to realize this approach. 

At present, the field of microbial genetics is entering a new era where the genomes of 
several microorganisms are being completely sequenced. It is expected that in a decade, or so, 
the nucleotide sequences of the genomes of all the major human pathogens will be completely 
determined. The sequencing of the genomes of pathogens such as Haemophilus influenzae has 
allowed researchers to compare the homology of proteins encoded by the open reading frames 
(ORFs) with those of Escherichia coli, resulting in valuable insight into the H. influenzae 



metabolic features. Similar analyses, such as those performed with H. influenzae, will provide 
details of metabolism spanning the hierarchy of metabolic regulation from bacterial genomes to 
phenotypes. 

These developments provide exciting new opportunities to carry out conceptual 
experiments in silico to analyze different aspects of microbial metabolism and its regulation. 
Further, the synthesis of whole-cell models is made possible. Such models can account for each 
and every single metabolic reaction and thus enable the analysis of their role in overall cell 
function. To implement such analysis, however, a mathematical modeling and simulation 
framework is needed which can incorporate the extensive metabolic detail but still retain 
computational tractability. Fortunately, rigorous and tractable mathematical methods have been 
developed for the required systems analysis of metabolism. 

A mathematical approach that is well suited to account for genomic detail and avoid 
reliance on kinetic complexity has been developed based on well-known stoichiometry of 
metabolic reactions. This approach is based on metabolic flux balancing in a metabolic steady 
state. The history of flux balance models for metabolic analyses is relatively short. It has been 
applied to metabolic networks, and the study of adipocyte metabolism. Acetate secretion from E. 
coli under ATP maximization conditions and ethanol secretion by yeast have also been 
investigated using this approach. 

The complete sequencing of a bacterial genome and ORF assignment provides the 
information needed to determine the relevant metabolic reactions that constitute metabolism in a 
particular organism. Thus a flux-balance model can be formulated and several metabolic 
analyses can be performed to extract metabolic characteristics for a particular organism. The flux 
balance approach can be easily applied to systematically simulate the effect of single, as well as 
multiple, gene deletions. This analysis will provide a list of sensitive enzymes that could be 
potential antimicrobial targets. 

The need to consider a new paradigm for dealing with the emerging problem of antibiotic 
resistant pathogens is a problem of vital importance. The route towards the design of new 
antimicrobial agents must proceed along directions that are different from those of the past. The 
rapid growth in bioinformatics has provided a wealth of biochemical and genetic information 
that can be used to synthesize complete representations of cellular metabolism. These models can 



be analyzed with relative computational ease through flux-balance models and visual computing 
techniques. The ability to analyze the global metabolic network and understand the robustness 
and sensitivity of its regulation under various growth conditions offers promise in developing 
novel methods of antimicrobial chemotherapy. 

In one example, Pramanik et al. described a stoichiometric model of E. coli metabolism 
using flux-balance modeling techniques {Stoichiometric Model of Escherichia coli Metabolism: 
Incorporation of Growth-Rate Dependent Biomass Composition and Mechanistic Energy 
Requirements, Biotechnology and Bioengineering . Vol. 56, No. 4, November 20, 1997). 
However, the analytical methods described by Pramanik, et al. can only be used for situations in 
which biochemical knowledge exists for the reactions occurring within an organism. Pramanik, 
et al. produced a metabolic model of metabolism for E. coli based on biochemical information 
s rather than genomic data since the metabolic genes and related reactions for E. coli had already 
been well studied and characterized. Thus, this method is inapplicable to determining a 
metabolic model for organisms for which little or no biochemical information on metabolic 
| enzymes and genes is known. It can be envisioned that in the future the only information we 
| may have regarding an emerging pathogen is its genomic sequence. What is needed in the art is 
a system and method for determining and analyzing the entire metabolic network of organisms 
whose metabolic reactions have not yet been determined from biochemical assays. The present 
invention provides such a system. 



Summary of the Invention 
This invention relates to constructing metabolic genotypes and genome specific 
stoichiometric matrices from genome annotation data. The functions of the metabolic genes in 
the target organism are determined by homology searches against databases of genes from 
similar organisms. Once a potential function is assigned to each metabolic gene of the target 
organism, the resulting data is analyzed. In one embodiment, each gene is subjected to a flux- 
balance analysis to assess the effects of genetic deletions on the ability of the target organism to 
produce essential biomolecules necessary for its growth. Thus, the invention provides a high- 
throughput computational method to screen for genetic deletions which adversely affect the 
growth capabilities of fully sequenced organisms. 



Embodiments of this invention also provide a computational, as opposed to an 
experimental, method for the rapid screening of genes and their gene products as potential drug 
targets to inhibit an organism's growth. This invention utilizes the genome sequence, the 
annotation data, and the biomass requirements of an organism to construct genomically complete 
metabolic genotypes and genome-specific stoichiometric matrices. These stoichiometric 
matrices are analyzed using a flux-balance analysis. This invention describes how to assess the 
affects of genetic deletions on the fitness and productive capabilities of an organism under given 
environmental and genetic conditions. 

Construction of a genome-specific stoichiometric matrix from genomic annotation data is 
illustrated along with applying flux-balance analysis to study the properties of the stoichiometric 
matrix, and hence the metabolic genotype of the organism. By limiting the constraints on 
various fluxes and altering the environmental inputs to the metabolic network, genetic deletions 
may be analyzed for their affects on growth. This invention is embodied in a software 
application that can be used to create the stoichiometric matrix for a fully sequenced and 
annotated genome. Additionally, the software application can be used to further analyze and 
manipulate the network so as to predict the ability of an organism to produce biomolecules 
necessary for growth, thus, essentially simulating a genetic deletion. 

Brief Description of the Drawings 

Figure 1 is a flow diagram illustrating one procedure for creating metabolic genotypes 
from genomic sequence data for any organism. 

Figure 2 is a flow diagram illustrating one procedure for producing in silico microbial 
strains from the metabolic genotypes created by the method of Figure 1, along with additional 
biochemical and microbiological data. 

Figure 3 is a graph illustrating a prediction of genome scale shifts in transcription. The 
graph shows the different phases of the metabolic response to varying oxygen availability, 
starting from completely aerobic to completely anaerobic in E. coli. The predicted changes in 
expression pattern between phases II and V are indicated. 



Detailed Description of the Invention 
This invention relates to systems and methods for utilizing genome annotation data to 
construct a stoichiometric matrix representing most of all of the metabolic reactions that occur 
within an organism. Using these systems and methods, the properties of this matrix can be 
studied under conditions simulating genetic deletions in order to predict the affect of a particular 
gene on the fitness of the organism. Moreover, genes that are vital to the growth of an organism 
can be found by selectively removing various genes from the stoichiometric matrix and 
thereafter analyzing whether an organism with this genetic makeup could survive. Analysis of 
these lethal genetic mutations is useful for identifying potential genetic targets for anti-microbial 
drugs. 

It should be noted that the systems and methods described herein can be implemented on 
any conventional host computer system, such as those based on Intel® microprocessors and 
running Microsoft Windows operating systems. Other systems, such as those using the UNIX or 
LINUX operating system and based on IBM®, DEC® or Motorola® microprocessors are also 
contemplated. The systems and methods described herein can also be implemented to run on 
client-server systems and wide-area networks, such as the Internet. 

Software to implement the system can be written in any well-known computer language, 
such as Java, C, C++, Visual Basic, FORTRAN or COBOL and compiled using any well-known 
compatible compiler. 

The software of the invention normally runs from instructions stored in a memory on the 
host computer system. Such a memory can be a hard disk, Random Access Memory, Read Only 
Memory and Flash Memory. Other types of memories are also contemplated to function within 
the scope of the invention. 

A process 10 for producing metabolic genotypes from an organism is shown in Figure 1. 
Beginning at a start state 12, the process 10 then moves to a state 14 to obtain the genomic DNA 
sequence of an organism. The nucleotide sequence of the genomic DNA can be rapidly 
determined for an organism with a genome size on the order of a few million base pairs. One 
method for obtaining the nucleotide sequences in a genome is through commercial gene 
databases. Many gene sequences are available on-line through a number of sites (see, for 
example, www.tigr.org ) and can easily be downloaded from the Internet. Currently, there are 16 



-6- 



microbial genomes that have been fully sequenced and are publicly available, with countless 
others held in proprietary databases. It is expected that a number of other organisms, including 
pathogenic organisms will be found in nature for which little experimental information, except 
for its genome sequence, will be available. 

Once the nucleotide sequence of the entire genomic DNA in the target organism has been 
obtained at state 14, the coding regions, also known as open reading frames, are determined at a 
state 16. Using existing computer algorithms, the location of open reading frames that encode 
genes from within the genome can be determined. For example, to identify the proper location, 
strand, and reading frame of an open reading frame one can perform a gene search by signal 
(promoters, ribosomal binding sites, etc.) or by content (positional base frequencies, codon 
preference). Computer programs for determining open reading frames are available, for 
example, by the University of Wisconsin Genetics Computer Group and the National Center for 
Biotechnology Information. 

After the location of the open reading frames have been determined at state 16, the 
process 10 moves to state 18 to assign a function to the protein encoded by the open reading 
frame. The discovery that an open reading frame or gene has sequence homology to a gene 
coding for a protein of known function, or family of proteins of known function, can provide the 
first clues about the gene and it's related protein's function. After the locations of the open 
reading frames have been determined in the genomic DNA from the target organism, well- 
established algorithms (i.e. the Basic Local Alignment Search Tool (BLAST) and the FAST 
family of programs can be used to determine the extent of similarity between a given sequence 
and gene/protein sequences deposited in worldwide genetic databases. If a coding region from a 
gene in the target organism is homologous to a gene within one of the sequence databases, the 
open reading frame is assigned a function similar to the homologously matched gene. Thus, the 
functions of nearly the entire gene complement or genotype of an organism can be determined so 
long as homologous genes have already been discovered. 

All of the genes involved in metabolic reactions and functions in a cell comprise only a 
subset of the genotype. This subset of genes is referred to as the metabolic genotype of a 
particular organism. Thus, the metabolic genotype of an organism includes most or all of the 
genes involved in the organism's metabolism. The gene products produced from the set of 



metabolic genes in the metabolic genotype carry out all or most of the enzymatic reactions and 
transport reactions known to occur within the target organism as determined from the genomic 
sequence. 

To begin the selection of this subset of genes, one can simply search through the list of 
functional gene assignments from state 18 to find genes involved in cellular metabolism. This 
would include genes involved in central metabolism, amino acid metabolism, nucleotide 
metabolism, fatty acid and lipid metabolism, carbohydrate assimilation, vitamin and cofactor 
biosynthesis, energy and redox generation, etc. This subset is generated at a state 20. The 
process 10 of determining metabolic genotype of the target organism from genomic data then 
terminates at an end stage 22. 

Referring now to Figure 2, the process 50 of producing a computer model of an 
organism. This process is also known as producing in silico microbial strains. The process 50 
begins at a start state 52 (same as end state 22 of process 10) and then moves to a state 54 
wherein biochemical information is gathered for the reactions performed by each metabolic 
gene product for each of the genes in the metabolic genotype determined from process 10. 

For each gene in the metabolic genotype, the substrates and products, as well as the 
stoichiometry of any and all reactions performed by the gene product of each gene can be 
determined by reference to the biochemical literature. This includes information regarding the 
irreversible or reversible nature of the reactions. The stoichiometry of each reaction provides 
the molecular ratios in which reactants are converted into products. 

Potentially, there may still remain a few reactions in cellular metabolism which are 
known to occur from in vitro assays and experimental data. These would include well 
characterized reactions for which a gene or protein has yet to be identified, or was unidentified 
from the genome sequencing and functional assignment of state 14 and 18. This would also 
include the transport of metabolites into or out of the cell by uncharacterized genes related to 
transport. Thus one reason for the missing gene information may be due to a lack of 
characterization of the actual gene that performs a known biochemical conversion. Therefore 
upon careful review of existing biochemical literature and available experimental data, 
additional metabolic reactions can be added to the list of metabolic reactions determined from 



the metabolic genotype from state 54 at a state 56. This would include information regarding 
the substrates, products, reversibility/irreversibility, and stoichiometry of the reactions. 

All of the information obtained at states 54 and 56 regarding reactions and their 
stoichiometry can be represented in a matrix format typically referred to as a stoichiometric 
matrix. Each column in the matrix corresponds to a given reaction or flux, and each row 
corresponds to the different metabolites involved in the given reaction/flux. Reversible 
reactions may either be represented as one reaction that operates in both the forward and reverse 
direction or be decomposed into one forward reaction and one backward reaction in which case 
all fluxes can only take on positive values. Thus, a given position in the matrix describes the 
stoichiometric participation of a metabolite (listed in the given row) in a particular flux of 
interest (listed in the given column). Together all of the columns of the genome specific 
stoichiometric matrix represent all of the chemical conversions and cellular transport processes 
that are determined to be present in the organism. This includes all internal fluxes and so called 
exchange fluxes operating within the metabolic network. Thus, the process 50 moves to a state 
58 in order to formulate all of the cellular reactions together in a genome specific stoichiometric 
matrix. The resulting genome specific stoichiometric matrix is a fundamental representation of 
a genomically and biochemically defined genotype. 

After the genome specific stoichiometric matrix is defined at state 58, the metabolic 
demands placed on the organism are calculated. The metabolic demands can be readily 
determined from the dry weight composition of the cell. In the case of well-studied organisms 
such as Escherichia coli and Bacillus subtilis, the dry weight composition is available in the 
published literature. However, in some cases it will be necessary to experimentally determine 
the dry weight composition of the cell for the organism in question. This can be accomplished 
with varying degrees of accuracy. The first attempt would measure the RNA, DNA, protein, 
and lipid fractions of the cell. A more detailed analysis would also provide the specific fraction 
of nucleotides, amino acids, etc. The process 50 moves to state 60 for the determination of the 
biomass composition of the target organism. 

The process 50 then moves to state 62 to perform several experiments that determine the 
uptake rates and maintenance requirements for the organism. Microbiological experiments can 
be carried out to determine the uptake rates for many of the metabolites that are transported into 



the cell. The uptake rate is determined by measuring the depletion of the substrate from the 
growth media. The measurement of the biomass at each point is also required, in order to 
determine the uptake rate per unit biomass. The maintenance requirements can be determined 
from a chemostat experiment. The glucose uptake rate is plotted versus the growth rate, and the 
y-intercept is interpreted as the non-growth associated maintenance requirements. The growth 
associated maintenance requirements are determined by fitting the model results to the 
experimentally determined points in the growth rate versus glucose uptake rate plot. 

Next, the process 50 moves to a state 64 wherein information regarding the metabolic 
demands and uptake rates obtained at state 62 are combined with the genome specific 
stoichiometric matrix of step 8 together fully define the metabolic system using flux balance 
analysis (FBA). This is an approach well suited to account for genomic detail as it has been 
developed based on the well-known stoichiometry of metabolic reactions. 
The time constants characterizing metabolic transients and/or metabolic reactions are typically 
very rapid, on the order of milli-seconds to seconds, compared to the time constants of cell 
growth on the order of hours to days. Thus, the transient mass balances can be simplified to 
only consider the steady state behavior. Eliminating the time derivatives obtained from dynamic 
mass balances around every metabolite in the metabolic system, yields the system of linear 
equations represented in matrix notation, 

S • v = 0 Equation 1 

where S refers to the stoichiometric matrix of the system, and v is the flux vector. This equation 
simply states that over long times, the formation fluxes of a metabolite must be balanced by the 
degradation fluxes. Otherwise, significant amounts of the metabolite will accumulate inside the 
metabolic network. Applying equation 1 to our system we let S now represent the genome 
specific stoichiometric matrix 

To determine the metabolic capabilities of a defined metabolic genotype Equation 1 is 
solved for the metabolic fluxes and the internal metabolic reactions, v, while imposing 
constraints on the activity of these fluxes. Typically the number of metabolic fluxes is greater 
than the number of mass balances (i.e., m > n) resulting in a plurality of feasible flux 
distributions that satisfy Equation 1 and any constraints placed on the fluxes of the system. This 
range of solutions is indicative of the flexibility in the flux distributions that can be achieved 



-10- 



with a given set of metabolic reactions. The solutions to Equation 1 lie in a restricted region. 
This subspace defines the capabilities of the metabolic genotype of a given organism, since the 
allowable solutions that satisfy Equation 1 and any constraints placed on the fluxes of the 
system define all the metabolic flux distributions that can be achieved with a particular set of 
metabolic genes. 

The particular utilization of the metabolic genotype can be defined as the metabolic 
phenotype that is expressed under those particular conditions. Objectives for metabolic function 
can be chosen to explore the 'best' use of the metabolic network within a given metabolic 
genotype. The solution to equation 1 can be formulated as a linear programming problem, in 
which the flux distribution that minimizes a particular objective if found. Mathematically, this 
optimization can be stated as; 

Minimize Z Equation 2 

where Z = ^c i • v i = (c • v) Equation 3 

where Z is the objective which is represented as a linear combination of metabolic fluxes v z \ The 
optimization can also be stated as the equivalent maximization problem; i.e. by changing the sign 
on Z. 

This general representation of Z enables the formulation of a number of diverse 
objectives. These objectives can be design objectives for a strain, exploitation of the metabolic 
capabilities of a genotype, or physiologically meaningful objective functions, such as maximum 
cellular growth. For this application, growth is to be defined in terms of biosynthetic 
requirements based on literature values of biomass composition or experimentally determined 
values such as those obtained from state 60. Thus, we can define biomass generation as an 
additional reaction flux draining intermediate metabolites in the appropriate ratios and 
represented as an objective function Z. In addition to draining intermediate metabolites this 
reaction flux can be formed to utilize energy molecules such as ATP, NADH and NADPH so as 
to incorporate any maintenance requirement that must be met. This new reaction flux then 
becomes another constraint/balance equation that the system must satisfy as the objective 
function. It is analagous to adding an addition column to the stoichiometric matrix of Equation 1 
to represent such a flux to describe the production demands placed on the metabolic system. 



-11- 



Setting this new flux as the objective function and asking the system to maximize the value of 
this flux for a given set of constraints on all the other fluxes is then a method to simulate the 
growth of the organism. 

Using linear programming, additional constraints can be placed on the value of any of the 
fluxes in the metabolic network. 

Pj < Vj < CCj 

Equation 4 

These constraints could be representative of a maximum allowable flux through a given 
reaction, possibly resulting from a limited amount of an enzyme present in which case the value 
for aj would take on a finite value. These constraints could also be used to include the 
knowledge of the minimum flux through a certain metabolic reaction in which case the value for 
pj would take on a finite value. Additionally, if one chooses to leave certain reversible reactions 
or transport fluxes to operate in a forward and reverse manner the flux may remain unconstrained 
by setting pj to negative infinity and otj to positive infinity. If reactions proceed only in the 
forward reaction pj is set to zero while aj is set to positive infinity. As an example, to simulate 
the event of a genetic deletion the flux through all of the corresponding metabolic reactions 
related to the gene in question are reduced to zero by setting pj and aj to be zero in Equation 4. 
Based on the in vivo environment where the bacteria lives one can determine the metabolic 
resources available to the cell for biosynthesis of essentially molecules for biomass. Allowing 
the corresponding transport fluxes to be active provides the in silico bacteria with inputs and 
outputs for substrates and by-products produces by the metabolic network. Therefore as an 
example, if one wished to simulate the absence of a particular growth substrate one simply 
constrains the corresponding transport fluxes allowing the metabolite to enter the cell to be zero 
by allowing Pj and aj to be zero in Equation 4. On the other hand if a substrate is only allowed 
to enter or exit the cell via transport mechanisms, the corresponding fluxes can be properly 
constrained to reflect this scenario. 

Together the linear programming representation of the genome-specific stoichiometric 
matrix as in Equation 1 along with any general constraints placed on the fluxes in the system, 
and any of the possible objective functions completes the formulation of the in silico bacterial 



-12- 



strain. The in silico strain can then be used to study theoretical metabolic capabilities by 
simulating any number of conditions and generating flux distributions through the use of linear 
programming. The process 50 of formulating the in silico strain and simulating its behavior 
using linear programming techniques terminates at an end state 66. 

Thus, by adding or removing constraints on various fluxes in the network it is possible to 
(1) simulate a genetic deletion event and (2) simulate or accurately provide the network with the 
metabolic resources present in its in vivo environment. Using flux balance analysis it is possible 
to determine the affects of the removal or addition of particular genes and their associated 
reactions to the composition of the metabolic genotype on the range of possible metabolic 
phenotypes. If the removal/deletion does not allow the metabolic network to produce necessary 
precursors for growth, and the cell can not obtain these precursors from its environment, the 
deletion(s) has the potential as an antimicrobial drug target. Thus by adjusting the constraints 
and defining the objective function we can explore the capabilities of the metabolic genotype 
using linear programming to optimize the flux distribution through the metabolic network. This 
creates what we will refer to as an in silico bacterial strain capable of being studied and 
manipulated to analyze, interpret, and predict the genotype-phenotype relationship. It can be 
applied to assess the affects of incremental changes in the genotype or changing environmental 
conditions, and provide a tool for computer aided experimental design. It should be realized that 
other types of organisms can similarly be represented in silico and still be within the scope of the 
invention. 

The construction of a genome specific stoichiometric matrix and in silico microbial 
strains can also be applied to the area of signal transduction. The components of signaling 
networks can be identified within a genome and used to construct a content matrix that can be 
further analyzed using various techniques to be determined in the future. 

A. Example 1: E. coli metabolic genotype and in silico model 

Using the methods disclosed in Figures 1 and 2, an in silico strain of Escherichia coli K- 
12 has been constructed and represents the first such strain of a bacteria largely generated from 
annotated sequence data and from biochemical information. The genetic sequence and open 
reading frame identifications and assignments are readily available from a number of on-line 



-13- 



locations (ex: www.tigr.org). For this example we obtained the annotated sequence from the 
following website for the E. coli Genome Project at the University of Wisconsin 
(http ://www. genetics. wise . edu/) . Details regarding the actual sequencing and annotation of the 
sequence can be found at that site. From the genome annotation data the subset of genes 
involved in cellular metabolism was determined as described above in Figure 1, state 20, 
comprising the metabolic genotype of the particular strain of E. coli. 

Through detailed analysis of the published biochemical literature on E. coli we 
determined (1) all of the reactions associated with the genes in the metabolic genotype and (2) 
any additional reactions known to occur from biochemical data which were not represented by 
the genes in the metabolic genotype. This provided all of the necessary information to construct 
the genome specific stoichiometric matrix for E. coli K-12. 

Briefly, the E. coli K-12 bacterial metabolic genotype and more specifically the genome 
specific stoichiometric matrix contains 731 metabolic processes that influence 436 metabolites 
(dimensions of the genome specific stoichiometric matrix are 436 x 731). There are 80 reactions 
present in the genome specific stoichiometric matrix that do not have a genetic assignment in the 
annotated genome, but are known to be present from biochemical data. The genes contained 
within this metabolic genotype are shown in Table 1 along with the corresponding reactions they 
carry out. 

Because E. coli is arguably the best studied organism, it was possible to determine the 
uptake rates and maintenance requirements (state 62 of Figure 2) by reference to the published 
literature. This in silico strain accounts for the metabolic capabilities of E. coli. It includes 
membrane transport processes, the central catabolic pathways, utilization of alternative carbon 
sources and the biosynthetic pathways that generate all the components of the biomass. In the 
case of E. coli K-12, we can call upon the wealth of data on overall metabolic behavior and 
detailed biochemical information about the in vivo genotype to which we can compare the 
behavior of the in silico strain. One utility of FBA is the ability to learn about the physiology of 
the particular organism and explore its metabolic capabilities without any specific biochemical 
data. This ability is important considering possible future scenarios in which the only data that 
we may have for a newly discovered bacterium (perhaps pathogenic) could be its genome 
sequence. 



-14- 



B. Example 2: in silico deletion analysis for E. coli to find antimicrobial targets 

Using the in silico strain constructed in Example 1, the effect of individual deletions of 
all the enzymes in central metabolism can be examined in silico. For the analysis to determine 
sensitive linkages in the metabolic network of E. coli, the objective function utilized is the 
maximization of the biomass yield. This is defined as a flux draining the necessary biosynthetic 
precursors in the appropriate ratios. This flux is defined as the biomass composition, which can 
be determined from the literature. See Neidhardt et. al, Escherichia coli and Salmonella: 
Cellular and Molecular BioloRy , Second Edition, ASM Press, Washington D.C., 1996. Thus, the 
objective function is the maximization of a single flux, this biosynthetic flux. 

Constraints are placed on the network to account for the availability of substrates for the 
growth of E. coli. In the initial deletion analysis, growth was simulated in an aerobic glucose 
minimal media culture. Therefore, the constraints are set to allow for the components included 
in the media to be taken up. The specific uptake rate can be included if the value is known, 
otherwise, an unlimited supply can be provided. The uptake rate of glucose and oxygen have 
been determined for E. coli (Neidhardt et. al., Escherichia coli and Salmonella: Cellular and 
Molecular Biology , Second Edition, ASM Press, Washington D.C., 1996. Therefore, these 
values are included in the analysis. The uptake rate for phosphate, sulfur, and nitrogen source is 
not precisely known, so constraints on the fluxes for the uptake of these important substrates is 
not included, and the metabolic network is allowed to take up any required amount of these 
substrates. 

The results showed that a high degree of redundancy exists in central intermediary 
metabolism during growth in glucose minimal media, which is related to the interconnectivity of 
the metabolic reactions. Only a few metabolic functions were found to be essential such that their 
loss removes the capability of cellular growth on glucose. For growth on glucose, the essential 
gene products are involved in the 3 -carbon stage of glycolysis, three reactions of the TCA cycle, 
and several points within the PPP. Deletions in the 6-carbon stage of glycolysis result in a 
reduced ability to support growth due to the diversion of additional flux through the PPP. 



-15- 



The results from the gene deletion study can be directly compared with growth data from 
mutants. The growth characteristics of a series of E. coli mutants on several different carbon 
sources were examined (80 cases were determined from the literature), and compared to the in 
silico deletion results (Table 2). The majority (73 of 80 cases or 91%) of the mutant 
experimental observations are consistent with the predictions of the in silico study. The results 
from the in silico gene deletion analysis are thus consistent with experimental observations. 

C. Example 3: Prediction of genome scale shifts in gene expression 

Flux based analysis can be used to predict metabolic phenotypes under different growth 
conditions, such as substrate and oxygen availability. The relation between the flux value and 
the gene expression levels is non-linear, resulting in bifurcations and multiple steady states. 
However, FBA can give qualitative (on/off) information as well as the relative importance of 
gene products under a given condition. Based on the magnitude of the metabolic fluxes, 
qualitative assessment of gene expression can be inferred. 

Figure 3a shows the five phases of distinct metabolic behavior of E, Coli in response to 
varying oxygen availability, going from completely anaerobic (phase I) to completely aerobic 
(phase V). Figures 3b and 3c display lists of the genes that are predicted to be induced or 
repressed upon the shift from aerobic growth (phase V) to nearly complete anaerobic growth 
(phase II). The numerical values shown in Figures 3b and 3 c are the fold change in the 
magnitude of the fluxes calculated for each of the listed enzymes. 

For this example, the objective of maximization of biomass yield is utilized (as described 
above). The constraints on the system are also set accordingly (as described above). However, 
in this example, a change in the availability of a key substrate is leading to changes in the 
metabolic behavior. The change in the parameter is reflected as a change in the uptake flux. 
Therefore, the maximal allowable oxygen uptake rate is changed to generate this data. The 
figure demonstrates how several fluxes in the metabolic network will change as the oxygen 
uptake flux is continuously decreased. Therefore, the constraints on the fluxes is identical to 
what is described in the previous section, however, the oxygen uptake rate is set to coincide with 
the point in the diagram. 



-16- 



Corresponding experimental data sets are now becoming available. Using high-density 
oligonucleotide arrays the expression levels of nearly every gene in Saccharomyces cerevisiae 
can now be analyzed under various growth conditions. From these studies it was shown that 
nearly 90% of all yeast mRNAs are present in growth on rich and minimal media, while a large 
number of mRNAs were shown to be differentially expressed under these two conditions. 
Another recent article shows how the metabolic and genetic control of gene expression can be 
studied on a genomic scale using DNA microarray technology (Exploring the Metabolic and 
Genetic Control of Gene Expression on a Genomic Scale, Science , Vol. 278, October 24, 1997. 
The temporal changes in genetic expression profiles that occur during the diauxic shift in & 
cerevisiae were observed for every known expressed sequence tag (EST) in this genome. As 
shown above, FBA can be used to qualitatively simulate shifts in metabolic genotype expression 
patterns due to alterations in growth environments. Thus, FBA can serve to complement current 
studies in metabolic gene expression, by providing a fundamental approach to analyze, interpret, 
and predict the data from such experiments. 

D. Example 4: Design of defined media 

An important economic consideration in large-scale bioprocesses is optimal medium 
formulation. FBA can be used to design such media. Following the approach defined above, a 
flux-balance model for the first completely sequenced free living organism, Haemophilus 
influenzae, has been generated. One application of this model is to predict a minimal defined 
media. It was found that H. influenzae can grow on the minimal defined medium as determined 
from the ORF assignments and predicted using FBA. Simulated bacterial growth was predicted 
using the following defined media: fructose, arginine, cysteine, glutamate, putrescine, 
spermidine, thiamin, NAD, tetrapyrrole, pantothenate, ammonia, phosphate. This predicted 
minimal medium was compared to the previously published defined media and was found to 
differ in only one compound, inosine. It is known that inosine is not required for growth, 
however it does serve to enhance growth. Again the in silico results obtained were consistent 
with published in vivo research. These results provide confidence in the use of this type of 
approach for the design of defined media for organisms in which there currently does not exist a 
defined media. 



-17- 



While particular embodiments of the invention have been described in detail, it will be 
apparent to those skilled in the art that these embodiments are exemplary rather than limiting, and 
the true scope of the invention is defined by the claims that follow. 



-18- 



g -9 



0) <D 



-H 




CD 



t/5 



§ 13 



B <2 



<D 
<D 



09 £ 



&0 CD 

O - s 
o o S 



CD o 



<D 



00 g .2 
^ ^ £ 

^ & g 

..9 8> 8 



8 



C3 



bo bo 



£p £p £p 



^^l^^^S boOoc^SSSgS) S2E£S;*S<bQbo 




n M 



<N <N »T> 



111 




to S 



o 

CM 



If 




o p 



HHHHHHrncnHHHfnVO^rl'H^^nMTj-^ 




p p p 

o o o o 

oq cq oq oq 

bo bo ^ ^ 



3 3 ^ S ^ is s 



H *s o 
_^ &o y ^ 

bo bo bo bo -ci 





^ ^ ^ CQ 
? d ^ O 




< O 



~ -b .a 

S 6 S 
p2 p2 



_ „ Cti O 

« o cj « 

3 -S3 3 g 

-2 -3 o 

J X ><{ p2 





oq O O to 
W S3 



bo 6b 



^ § ^ 



bo Q 



X °Q O Q 3 
t>0 bo bo bo bh C 

Q C3 C3 CS O 



to 

£o bo - K - ** ^ 
5 



, w - , >, oq O Q to 
O ^ Q £r <3 E?r Sr 5f >r E^ 1 



<D 



o 2 « 
t3 w s 

Mi 

ll * 

"S £ § 

O O Oh 

£ Q < 




HHHH(StnfSHHHM(nN 




1 
i 



ft o as 




a a 




pi 

^ A A 
^ V V 

v a a 
a £ g 



tin cq tj cq Q 

0^ ^ &j 

^ s s ^ 



CJ oq ^ 



Q U ttj ^ 1*5 lyj ^ 



^ ^ ^ ;x ;x ;x 



3 



^ o ^ 
g £ g 



^5 *3 



O 

N 

a 





+ + x 

< a a 

K g 
<d Q a. 



<I vo Cl, 
<N (N VD 
go go 



Q Q Q 

^ ^ ^ 



-j VU VL> 

^ S 3 £ s s 



bo bo 



3 3 

do too 



3 3 3 




•o -a 

s s 

P o 




4 

O P o 



5; S; sj S; Si s; a 



^3 ^3 



^ ^ ^ 

bo bo 





CL, 0« pL, 

rH H 5 

s e3 ^ & 

* P O Q 




5 5 . . 

A Q £ <J ■ 



^ £i ^ -V ^ ^ 
^3 **3 ^3 *t3 ^3 *X3 

s; s; s: sc « e 



s: *3 \3 -i; >i t>j v3 
Q 53 S3 >s bo fco Q 




Ol M M h 



O o o O O Q o 




o 2 o 

O O Q U 



04 

+ + + 
o 5 u 



3"S 3 S 3 

S S S E S 

o o y o o 



^3 ^3 ^= 



3 



^3 



l 

cm 
i 



I 



O CD O 
WWW 

c3 cti 03 
BBC 

a a g 



W O G) D O 

cd c3 c3 ctf cd ,C3 
T £>\ 

T3 ^3 p o r a 1 T3 T T3 




o o o o o 

3 ^ £ ^ £ 

s b s s a 

o o o o o 

p jO jO jD jO 

2 2 2 2 2 



© 1-< 




9 

s; ^5 



3 



<3 



Q Qq £cj 

^ ^3 <s 

'S.'S.'S. 



4* 



^ I 3 

ft, ft, 



^ ^ ^ 
% & & 



O 3 Q> 



*Q *c <i <s 
a a a s 




o o 




O O 



+ 

<3 Ph Q Q O 



^ ^ X. X. 



~ " ^ ^ ^ ^ & a 



■c «a, 



o ^ % 




i 



O'-t'--'*'— <0000»— <i — it— ti— iOO^OOi— iOOOi— 'O'-hOO'— i»— — fi— (i— ( 




I 

CM 
CO 

I 



^1 



^ U si s: 

« > CS Q O 

^ - ^ ^ o 



a sc s: 



^3 ^3 ^3 

a q cs 
s: s: s: 



.bo sc 



^ oq O 

u u a s 

s: s; s: e 

s, s, §. §, 



! 

ft) 



k3 oq O 

s s e 

ft) ft) ft) 

^ ^ ^: 



"3 O O O 

fc Co Bo 

ft) ^ ?\ 

►5; O' ^ 





^ tc? ^ ^ 

^ ^ 

CX, CX, CX, 




(ii 52 «s 

Kg 

p p 



© 5 -3 .a y 
2 3 5 3 5 




^ ^ A cj e! o 

•3 3 *S -8 "S 53 

£ 2 8 S S| 

^ a, a & 



ftO W ffi 




o 



CS O O ^ vo 



m o 





i 

I 



(SHfflHHfOCOHf^HH^^VOHtSHI^OMHfflH^OOHfflHOHHHHOHHH 




r 

I 



o p 



1 1| 

<D <D <1> 

CO W «1 

O O O 

» § S « 

Q ^ ^cj j£j Ctf 
0 ^ _J\$ *3 



S 



■3 1 



o o 



CD o 
O 

o 




EiEOOOOOO 




Tfr Tj- H 1— I 



0H P-< PL, 

+ + + EE 

CU Oh + -f 

o o 




cq aq ^ 



^ ^ -a .£ 



s x o ^ o > ^ S ^ > J5 5; sir s: s; o « 

O s 



3 S ^5 



^ ^ cq 

•*-T 

b& b<3 x 



^. ^ ^ 
^ ^ 
bS s: r c 



u 

O 



.1 1 a I 



CD 
(=1 

O - 



a 



^ 7^ 53 w w ™ ^ g ^ o o 5 C 3 « 



Ph 0* 



_ _ & 
co oq cn 




o 

I 

§ 

o 

I 

T3 



K> Q O cu 




I 'I 3 s 6 a 

111 

X X X o o o 



O <N O 



1-H CT) 




« + a 7 + + + 

1 1 1 g 1 5 5 

U S X P P P p 



+ + 

U P 
P P 



P 



55 

CO 

a 

A 



A g 



£ 5 

M A 



A A A A ' J^Ah^ 

nil IWm i 



£3 

P P 



O ED 
P P 



2 2 

GO 
O 




ess 

P P " 




Jo ^ &5 

^ ^ ^ "+«4 



& IS 
=3 <s, 



03 



^8 



I 
I 



in 

o 

1 

co 

O 
Oh 



s a 

£ si 



T3 



Q<U(L><L><LJ(l>(l><DOO 
1 "0 t O *t3 t O *T3 t O *0 "^3 



CO CO CO CO CO 

O O O O O 

O (1> (D <L> <D 



CO CO CO 

o o o 

<D <L> <3) 



CO CO 

O O 



IIIIIIlIlllllcj 

oaoooooouuouuu 




t3 « % £ 
3 



2 2 



2 S 2 
►3 £ 5 



o o 

2 2 



2 9 9 

III 



o § S 52 H 

(D 4) S g O 

^ a ^ ts 3 

2 2 «! J J 



i * 3 



O t3 pi 
ft, pc3 oq Cu 




h o o o o o o 




1 




ts ti 

o o o 

CXi 

www 

§ § 9 8 
* £ u & 

_ j§ 

"5 _ _ 

cd ctJ cd 

I) 



CJ <L> 
t3 tJ 



CO 



° - =* 

^ T3 ,3 *£j *£j S 

•« -s ■§ -§> i 1| 

n, pl, n, a £ £ pu, 



ft 



S O <d 
ft 

a o 

Q ^ 

S * 
■a 1-8 

oo O U 



ts S 

o £ 

Oh 

w £ 

II 



w O 

I §* 

O H 
J? ^ 

£ 8 



Table 2 

Comparison of the predicted mutant growth characteristics from the gene deletion study 
to published experimental results with single and double mutants. 



Gene 


Glucose 

(in uru/i/iM ct///"v>i 
\tii vlVUfl/t oHliflfj 


Glycerol 

( in vivnfin vilirriS 


Succinate 
(in vivolin sihed) 


Acetate 
(m? vivolin silico) 


aceEF 


-/ + 








aceA 










aceB 








-/- 


ackA 








+/+ 


acs 








+/+ 


acn 


'/- 


~h 


-/- 


-/- 


cyd 


+ / + 








cyo 


+ / + 








eno 


-/ + 


"/ + 


-/- 


-/- 


fba 


"/ + 








fiP 


+ / + 


-/- 


-/- 


-/- 


gap 






-/- 


-/- 


gM 






-/- 


-/- 


grid 


+ / + 








idh 


-/- 


-/" 




-/- 


ndh 


+ / + 


+ / + 






nuo 


+ / + 


+ / + 






Pfi 


"/ + 








Pgi 


+ / + 


+ / + 






pgk 


-/- 


"/- 


-/- 


-/- 


P$ 


+ / + 








pntAB 


+ / + 


+ / + 


+/+ 


+/ + 


glk 


+ / + 








ppc 


+ / + 


"/ + 


+/+ 


+ /+ 


pta 










pts 


+ / + 








pyk 


+ / + 








rpi 


/ 




/ 


_ / _ 


sdhABCD 


+ / + 








tpi 


-/ + 




-/ - 


- /- 


unc 


+ / + 






- /- 


zwf 


+A 








sue AD 


+ /+ 








zwf, pnt 


+/+ 








pek, mez 






-/- 


-/- 


pek, pps 






-/- 


-/- 


pgk zwf 


-/- 








pgi, gnd 


-/- 








pta, acs 








-/- 


tktA, tktB 


-/- 









Results are scored as + or - meaning growth or no growth determined from in vivo I in silico data. In 73 
of 80 cases the in silico behavior is the same as the experimentally observed behavior. 



-39- 



